Somatic Mutations

The goal of this notebook is to introduce you to the Somatic Mutations BigQuery table.

This table is based on the open-access somatic mutation calls available in MAF files at the DCC. In addition to uploading all current MAF files from the DCC, the mutations were also annotated using Oncotator. A subset of the columns in the underlying MAF files and a subset of the Oncotator outputs were then assembled in this table.

In addition, the ETL process includes several data-cleaning steps because many tumor types actually have multiple current MAF files and therefore potentially duplicate mutation calls. In some cases, a tumor sample may have had mutations called relative to both a blood-normal and an adjacent-tissue sample, and in other cases MAF files may contain mutations called on more than one aliquot from the same sample. Every effort was made to include all of the available data at the DCC while avoiding having multiple rows in the mutation table describing the same somatic mutation. Note, however, that if the same mutation was called by multiple centers and appeared in different MAF files, it may be described on muliple rows (as you will see later in this notebook). Furthermore, in some cases, the underlying MAF file may have been based on a multi-center mutationa-calling exercise, in which case you may see a list of centers in the Center field, eg "bcgsc.ca;broad.mit.edu;hgsc.bcm.edu;mdanderson.org;ucsc.edu".

In conclusion, if you are counting up the number of mutations observed in a sample or a patient or a tumor-type, be sure to include the necessary GROUP BY clause(s) in order to avoid double-counting!

As usual, in order to work with BigQuery, you need to import the bigquery module (gcp.bigquery) and you need to know the name(s) of the table(s) you are going to be working with:


In [1]:
import gcp.bigquery as bq
somatic_mutations_BQtable = bq.Table('isb-cgc:tcga_201607_beta.Somatic_Mutation_calls')

Let's start by taking a look at the table schema:


In [2]:
%bigquery schema --table $somatic_mutations_BQtable


Out[2]:

That's a lot of fields! Let's dig in a bit further to see what is included in this table. For example let's count up the number of unique patients, tumor-samples, and normal-samples based on barcode identifiers:


In [3]:
%%sql --module count_unique

SELECT COUNT(DISTINCT $f,25000) AS n
FROM $t

In [4]:
fieldList = ['ParticipantBarcode', 'Tumor_SampleBarcode', 'Normal_SampleBarcode' ]
for aField in fieldList:
  field = somatic_mutations_BQtable.schema[aField]
  rdf = bq.Query(count_unique,t=somatic_mutations_BQtable,f=field).to_dataframe()
  print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)


 There are  10581 unique values in the field ParticipantBarcode. 
 There are  10698 unique values in the field Tumor_SampleBarcode. 
 There are  10581 unique values in the field Normal_SampleBarcode. 

Now let's look at a few key fields and find the top-5 most frequent values in each field:


In [5]:
%%sql --module top_5_values

SELECT $f, COUNT(*) AS n
FROM $t
WHERE ( $f IS NOT NULL )
GROUP BY $f
ORDER BY n DESC
LIMIT 5

You can use the parameterized query defined above to find the top-5 most frequently occurring values for any field of interest, for example:


In [6]:
bq.Query(top_5_values,t=somatic_mutations_BQtable,f=somatic_mutations_BQtable.schema['Hugo_Symbol']).results().to_dataframe()


Out[6]:
Hugo_Symbol n
0 Unknown 63962
1 TTN 20183
2 MUC16 11742
3 FLG 11039
4 OBSCN 6967

In [7]:
bq.Query(top_5_values,t=somatic_mutations_BQtable,f=somatic_mutations_BQtable.schema['Center']).results().to_dataframe()


Out[7]:
Center n
0 broad.mit.edu 3798074
1 broad.mit.edu;hgsc.bcm.edu 346100
2 broad.mit.edu;genome.wustl.edu 253889
3 bcgsc.ca 219239
4 bcgsc.ca;broad.mit.edu 218444

In [8]:
bq.Query(top_5_values,t=somatic_mutations_BQtable,f=somatic_mutations_BQtable.schema['Mutation_Status']).results().to_dataframe()


Out[8]:
Mutation_Status n
0 Somatic 5355991

In [9]:
bq.Query(top_5_values,t=somatic_mutations_BQtable,f=somatic_mutations_BQtable.schema['Protein_Change']).results().to_dataframe()


Out[9]:
Protein_Change n
0 p.M1I 1679
1 p.V600E 588
2 p.R132H 533
3 p.M1V 463
4 p.Q269Q 398

Everyone probably recognizes the V600E mutation in the previous result, so let's use that well-known BRAF mutation as a way to explore what other information is available in this table.


In [10]:
%%sql --module find_BRAF_V600E

SELECT
  Tumor_SampleBarcode,
  Study,
  Hugo_Symbol,
  Genome_Change,
  Protein_Change
FROM
  $t
WHERE
  ( Hugo_Symbol="BRAF"
    AND Protein_Change="p.V600E" )
GROUP BY
  Tumor_SampleBarcode,
  Study,
  Hugo_Symbol,
  Genome_Change,
  Protein_Change
ORDER BY
  Study,
  Tumor_SampleBarcode

In [11]:
r = bq.Query(find_BRAF_V600E,t=somatic_mutations_BQtable).results()
r.to_dataframe()


Out[11]:
Tumor_SampleBarcode Study Hugo_Symbol Genome_Change Protein_Change
0 TCGA-YF-AA3L-01A BLCA BRAF g.chr7:140453136A>T p.V600E
1 TCGA-ZU-A8S4-01A CHOL BRAF g.chr7:140453136A>T p.V600E
2 TCGA-A6-2686-01A COAD BRAF g.chr7:140453136A>T p.V600E
3 TCGA-A6-3809-01A COAD BRAF g.chr7:140453136A>T p.V600E
4 TCGA-A6-5661-01A COAD BRAF g.chr7:140453136A>T p.V600E
5 TCGA-A6-5665-01A COAD BRAF g.chr7:140453136A>T p.V600E
6 TCGA-A6-6649-01A COAD BRAF g.chr7:140453136A>T p.V600E
7 TCGA-A6-6653-01A COAD BRAF g.chr7:140453136A>T p.V600E
8 TCGA-AA-3492-01A COAD BRAF g.chr7:140453136A>T p.V600E
9 TCGA-AA-3664-01A COAD BRAF g.chr7:140453136A>T p.V600E
10 TCGA-AA-3684-01A COAD BRAF g.chr7:140453136A>T p.V600E
11 TCGA-AA-3713-01A COAD BRAF g.chr7:140453136A>T p.V600E
12 TCGA-AA-3715-01A COAD BRAF g.chr7:140453136A>T p.V600E
13 TCGA-AA-3815-01A COAD BRAF g.chr7:140453136A>T p.V600E
14 TCGA-AA-3821-01A COAD BRAF g.chr7:140453136A>T p.V600E
15 TCGA-AA-3833-01A COAD BRAF g.chr7:140453136A>T p.V600E
16 TCGA-AA-3877-01A COAD BRAF g.chr7:140453136A>T p.V600E
17 TCGA-AA-3947-01A COAD BRAF g.chr7:140453136A>T p.V600E
18 TCGA-AA-3949-01A COAD BRAF g.chr7:140453136A>T p.V600E
19 TCGA-AA-3966-01A COAD BRAF g.chr7:140453136A>T p.V600E
20 TCGA-AA-A01D-01A COAD BRAF g.chr7:140453136A>T p.V600E
21 TCGA-AA-A01P-01A COAD BRAF g.chr7:140453136A>T p.V600E
22 TCGA-AA-A022-01A COAD BRAF g.chr7:140453136A>T p.V600E
23 TCGA-AA-A02R-01A COAD BRAF g.chr7:140453136A>T p.V600E
24 TCGA-AD-5900-01A COAD BRAF g.chr7:140453136A>T p.V600E
25 TCGA-AD-6889-01A COAD BRAF g.chr7:140453136A>T p.V600E
26 TCGA-AD-A5EJ-01A COAD BRAF g.chr7:140453136A>T p.V600E
27 TCGA-AM-5820-01A COAD BRAF g.chr7:140453136A>T p.V600E
28 TCGA-AM-5821-01A COAD BRAF g.chr7:140453136A>T p.V600E
29 TCGA-AU-6004-01A COAD BRAF g.chr7:140453136A>T p.V600E
... ... ... ... ... ...
557 TCGA-IM-A41Y-01A THCA BRAF g.chr7:140453136A>T p.V600E
558 TCGA-IM-A420-01A THCA BRAF g.chr7:140453136A>T p.V600E
559 TCGA-IM-A4EB-01A THCA BRAF g.chr7:140453136A>T p.V600E
560 TCGA-J8-A3NZ-01A THCA BRAF g.chr7:140453136A>T p.V600E
561 TCGA-J8-A3O2-01A THCA BRAF g.chr7:140453136A>T p.V600E
562 TCGA-J8-A3O2-06A THCA BRAF g.chr7:140453136A>T p.V600E
563 TCGA-J8-A3YD-01A THCA BRAF g.chr7:140453136A>T p.V600E
564 TCGA-J8-A3YE-01A THCA BRAF g.chr7:140453136A>T p.V600E
565 TCGA-J8-A3YF-01A THCA BRAF g.chr7:140453136A>T p.V600E
566 TCGA-J8-A3YG-01A THCA BRAF g.chr7:140453136A>T p.V600E
567 TCGA-J8-A3YH-01A THCA BRAF g.chr7:140453136A>T p.V600E
568 TCGA-J8-A3YH-06A THCA BRAF g.chr7:140453136A>T p.V600E
569 TCGA-J8-A42S-01A THCA BRAF g.chr7:140453136A>T p.V600E
570 TCGA-J8-A4HY-01A THCA BRAF g.chr7:140453136A>T p.V600E
571 TCGA-KS-A41F-01A THCA BRAF g.chr7:140453136A>T p.V600E
572 TCGA-KS-A41J-01A THCA BRAF g.chr7:140453136A>T p.V600E
573 TCGA-KS-A4I1-01A THCA BRAF g.chr7:140453136A>T p.V600E
574 TCGA-KS-A4I3-01A THCA BRAF g.chr7:140453136A>T p.V600E
575 TCGA-KS-A4I5-01A THCA BRAF g.chr7:140453136A>T p.V600E
576 TCGA-KS-A4I7-01A THCA BRAF g.chr7:140453136A>T p.V600E
577 TCGA-KS-A4I9-01A THCA BRAF g.chr7:140453136A>T p.V600E
578 TCGA-KS-A4IB-01A THCA BRAF g.chr7:140453136A>T p.V600E
579 TCGA-KS-A4IC-01A THCA BRAF g.chr7:140453136A>T p.V600E
580 TCGA-L6-A4EP-01A THCA BRAF g.chr7:140453136A>T p.V600E
581 TCGA-L6-A4EQ-01A THCA BRAF g.chr7:140453136A>T p.V600E
582 TCGA-L6-A4ET-01A THCA BRAF g.chr7:140453136A>T p.V600E
583 TCGA-L6-A4EU-01A THCA BRAF g.chr7:140453136A>T p.V600E
584 TCGA-MK-A4N6-01A THCA BRAF g.chr7:140453136A>T p.V600E
585 TCGA-MK-A4N7-01A THCA BRAF g.chr7:140453136A>T p.V600E
586 TCGA-MK-A4N9-01A THCA BRAF g.chr7:140453136A>T p.V600E

587 rows × 5 columns

Let's count these mutations up by study (tumor-type):


In [12]:
%%sql

SELECT Study, COUNT(*) AS n
FROM $r
GROUP BY Study
HAVING n > 1
ORDER BY n DESC


Out[12]:
Studyn
THCA300
SKCM212
COAD50
LUAD10
GBM6
KIRP2
LGG2

(rows: 7, time: 0.8s, 3KB processed, job: job_EhSIiJqwnPY5ciQgmSyhTsTvNWU)

You may have noticed that in our earlier query, we did a GROUP BY to make sure that we didn't count the same mutation called on the same sample more than once. We might want to GROUP BY patient instead to see if that changes our counts -- we may have multiple samples from some patients.


In [13]:
%%sql --module find_BRAF_V600E_by_patient

SELECT
  ParticipantBarcode,
  Study,
  Hugo_Symbol,
  Genome_Change,
  Protein_Change
FROM
  $t
WHERE
  ( Hugo_Symbol="BRAF"
    AND Protein_Change="p.V600E" )
GROUP BY
  ParticipantBarcode,
  Study,
  Hugo_Symbol,
  Genome_Change,
  Protein_Change
ORDER BY
  Study,
  ParticipantBarcode

In [14]:
r = bq.Query(find_BRAF_V600E_by_patient,t=somatic_mutations_BQtable).results()

In [15]:
%%sql

SELECT Study, COUNT(*) AS n
FROM $r
GROUP BY Study
HAVING n > 1
ORDER BY n DESC


Out[15]:
Studyn
THCA294
SKCM211
COAD50
LUAD9
GBM6
KIRP2
LGG2

(rows: 7, time: 0.8s, 3KB processed, job: job_wEBpxuzXwUPQ4A919iDBpYN7Zo4)

When we counted the number of mutated samples, we found 261 THCA samples, but when we counted the number of patients, we found 258 THCA patients, so let's see what's going on there.


In [16]:
%%sql

SELECT
  ParticipantBarcode,
  COUNT(*) AS m
FROM (
  SELECT
    ParticipantBarcode,
    Tumor_SampleBarcode,
    COUNT(*) AS n
  FROM
    $somatic_mutations_BQtable
  WHERE
    ( Hugo_Symbol="BRAF"
      AND Protein_Change="p.V600E"
      AND Study="THCA" )
  GROUP BY
    ParticipantBarcode,
    Tumor_SampleBarcode,
    )
GROUP BY
  ParticipantBarcode
HAVING
  m > 1
ORDER BY
  m DESC


Out[16]:
ParticipantBarcodem
TCGA-J8-A3YH2
TCGA-EM-A2P12
TCGA-EM-A3SU2
TCGA-J8-A3O22
TCGA-EM-A2CS2
TCGA-DE-A4MD2

(rows: 6, time: 1.0s, 286MB processed, job: job_77CJKLhHZId2oTUdPcV9wfrAeoA)

Sure enough, we see that the same mutation is reported twice for each of these three patients. Let's look at why:


In [17]:
%%sql

SELECT
  ParticipantBarcode,
  Tumor_SampleBarcode,
  Tumor_SampleTypeLetterCode,
  Normal_SampleBarcode,
  Normal_SampleTypeLetterCode,
  Center,
FROM
  $somatic_mutations_BQtable
WHERE
  ( Hugo_Symbol="BRAF"
    AND Protein_Change="p.V600E"
    AND Study="THCA"
    AND ParticipantBarcode="TCGA-EM-A2P1" )
ORDER BY
  Tumor_SampleBarcode,
  Normal_SampleBarcode,
  Center


Out[17]:
ParticipantBarcodeTumor_SampleBarcodeTumor_SampleTypeLetterCodeNormal_SampleBarcodeNormal_SampleTypeLetterCodeCenter
TCGA-EM-A2P1TCGA-EM-A2P1-01ATPTCGA-EM-A2P1-10ANBbcgsc.ca;broad.mit.edu;hgsc.bcm.edu
TCGA-EM-A2P1TCGA-EM-A2P1-06ATMTCGA-EM-A2P1-10ANBbcgsc.ca;broad.mit.edu;hgsc.bcm.edu

(rows: 2, time: 1.1s, 524MB processed, job: job_BNWbHokrmzPc3fY8PfCPiHjiSTM)

Aha! not only did this patient provide both a primary tumor (TP) and a metastatic (TM) sample, but we have mutation calls from three different centers.

Finally, let's pick out one of these mutations and see what some of the other fields in this table can tell us:


In [18]:
%%sql

SELECT
  ParticipantBarcode,
  Tumor_SampleTypeLetterCode,
  Normal_SampleTypeLetterCode,
  Study,
  Center,
  Variant_Type,
  Variant_Classification,
  Genome_Change,
  cDNA_Change,
  Protein_Change,
  UniProt_Region,
  COSMIC_Total_Alterations_In_Gene,
  DrugBank
FROM
  $somatic_mutations_BQtable
WHERE
  ( Hugo_Symbol="BRAF"
    AND Protein_Change="p.V600E"
    AND Study="THCA"
    AND ParticipantBarcode="TCGA-EM-A2P1"
    AND Tumor_SampleTypeLetterCode="TP"
    AND Center="broad.mit.edu" )


Out[18]:

(rows: 0, time: 0.8s, 778MB processed, job: job__anjZ4ojpEEynW7YDH6LTbgaQoc)

When working with variants or mutations, there is another public BigQuery table that you might find useful. Developed by Tute Genomics, this comprehensive, publicly-available database of over 8.5 billion known variants was announced earlier this year. This table includes several types of annotations and scores, such ase Polyphen2 and MutationAssessor, and a proprietary "Tute score" which estimates whether a SNP or indel is likely to be associate with Mendelian phenotypes.

For example, you can look up all exonic BRAF mutations in the TuteTable in less than 20 seconds:


In [19]:
%%sql --module BRAF_TUTE

SELECT
  Chr,
  Start,
  Func,
  Gene,
  AA,
  Polyphen2_HDIV_score,
  Polyphen2_HVAR_score,
  MutationAssessor_score,
  TUTE
FROM
  [silver-wall-555:TuteTable.hg19]
WHERE
  ( Gene="BRAF"
    AND Func="exonic" )
ORDER BY
  Start ASC

In [20]:
tuteBRAFscores = bq.Query(BRAF_TUTE).results().to_dataframe()
tuteBRAFscores.describe()


/usr/local/lib/python2.7/dist-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[20]:
Start Polyphen2_HDIV_score Polyphen2_HVAR_score MutationAssessor_score TUTE
count 6.903000e+03 4985.000000 4985.000000 4976.000000 6903.000000
mean 1.404945e+08 0.556226 0.413867 1.247881 0.332186
std 4.577719e+04 0.417165 0.395638 1.182233 0.306410
min 1.404344e+08 0.000000 0.000000 -2.770000 0.000000
25% 1.404540e+08 NaN NaN NaN 0.108000
50% 1.404874e+08 NaN NaN NaN 0.196700
75% 1.405087e+08 NaN NaN NaN 0.568100
max 1.406245e+08 1.000000 1.000000 4.505000 0.931700

Let's go back to the TCGA somatic mutations table and pull out all BRAF mutations and then join them with the matching mutations in the Tute Table so that we can compare the distribution of scores (eg MutationAssessor and TUTE) between the somatic mutations seen in TCGA and the larger set of variants contained in the Tute Table.


In [21]:
%%sql --module TCGA_BRAF

SELECT
  Hugo_Symbol,
  Protein_Change,
  MutationAssessor_score,
  TUTE
FROM (
  SELECT
    Hugo_Symbol,
    Protein_Change
  FROM
    $t
  WHERE
    ( Hugo_Symbol="BRAF" )
  GROUP BY
    Hugo_Symbol,
    Protein_Change ) AS tcga
JOIN (
  SELECT
    Gene,
    AA,
    MutationAssessor_score,
    TUTE
  FROM
    [silver-wall-555:TuteTable.hg19]
  WHERE
    ( Gene="BRAF" ) ) AS tute
ON
  tcga.Hugo_Symbol=tute.Gene
  AND tcga.Protein_Change=tute.AA

In [22]:
tcgaBRAFscores = bq.Query(TCGA_BRAF,t=somatic_mutations_BQtable).results().to_dataframe()
tcgaBRAFscores.describe()


Out[22]:
MutationAssessor_score TUTE
count 194.000000 298.000000
mean 1.531443 0.337101
std 1.376454 0.321675
min -1.285000 0.000000
25% NaN 0.096200
50% NaN 0.188800
75% NaN 0.645875
max 4.475000 0.928100

In [23]:
import numpy as np
import matplotlib.pyplot as plt

In [24]:
plt.hist(tuteBRAFscores['TUTE'],bins=50,normed=True,color='red',alpha=0.6,label='all variants');
plt.hist(tcgaBRAFscores['TUTE'],bins=50,normed=True,color='blue',alpha=0.4,label='TCGA somatic mutations');
plt.legend(loc='upper right');
plt.xlabel('TUTE score');
plt.ylabel('probability');



In [25]:
plt.hist(tuteBRAFscores['MutationAssessor_score'],bins=45,range=[-4,5],normed=True,color='red',alpha=0.6,label='all variants');
plt.hist(tcgaBRAFscores['MutationAssessor_score'],bins=45,range=[-4,5],normed=True,color='blue',alpha=0.4,label='TCGA somatic mutations');
plt.legend(loc='upper right');
plt.xlabel('MutationAssessor score');
plt.ylabel('probability');


Both of these plots suggest that some of the somatic BRAF mutations observed in TCGA tumor samples are scored as more deleterious by both TUTE and MutationAssessor. In the TUTE histogram, a larger fraction of somatic mutations also get a score of 0.

Note that in these histograms, each count represents a single variant, ie a specific protein change. Mutations that are seen across multiple patients are not being counted multiple times.


In [ ]: